### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
install.packages("Seurat")
if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages("tidyverse")
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
if (!requireNamespace("harmony", quietly = TRUE))
devtools::install_github("immunogenomics/harmony")
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages("colorBlindness")
if (!requireNamespace("DT", quietly = TRUE))
install.packages("DT")
if (!requireNamespace("scales", quietly = TRUE))
install.packages("scales")
if (!requireNamespace("ggraph", quietly = TRUE))
install.packages("ggraph")
if (!requireNamespace("tidygraph", quietly = TRUE))
install.packages("tidygraph")
if (!requireNamespace("ggforce", quietly = TRUE))
install.packages("ggforce")
if (!requireNamespace("ggalluvial", quietly = TRUE))
install.packages("ggalluvial")
if (!requireNamespace("corrplot", quietly = TRUE))
install.packages("corrplot")
if (!requireNamespace("stats", quietly = TRUE))
install.packages("stats")
if (!requireNamespace("fastDummies", quietly = TRUE))
install.packages("fastDummies")
### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(devtools)
library(harmony)
library(colorBlindness)
library(DT)
library(scales)
library(ggraph)
library(tidygraph)
library(RColorBrewer)
library(scales)
library(colorRamps)
library(ggforce)
library(ggalluvial)
library(corrplot)
library(stats)
library(fastDummies)
set.seed(687)4 - PCA, Harmony Integration, and kNN Graphs
Introduction
This vignette is a hands-on guide to understanding and applying principal component analysis (PCA), k-nearest neighbors (kNN) graph creation, and harmony integration in the context of scRNAseq data analysis. PCA and Harmony are two common methods for creating a dimensionality-reduced “latent space” from scRNAseq cell x gene matrices. “Latent space” is the term used to describe any matrix that can be calculated with a reduced size while still preserving the relationships between samples of the original matrix. PCA creates the latent space by deriving the combinations of genes that describe the greatest variance across samples in the dataset. Harmony creates a latent space that reduces a batch effect by reducing differences between batches within cluster or latent space variables separately. We calculate the final kNN graph of relationships between cells based on a latent space, because the process of creating the graph scales multiplicatively with number of input features.
Vocabulary
Latent Space
Any matrix that can be calculated from a data matrix with a reduced size while still preserving the relationships between samples of the original matrix. Latent means “existing as potential”, as in “latent abilities” - so a latent space is a hidden set of dimensions in the data that could be useful in ways the raw data is not.
Feature
A variable in a sample x variable matrix. In a cell x gene matrix, the gene is the feature. In PCA, the principle components are the features.
Dimensionality Reduction
The computational process of reducing a large data matrix to fewer features. For example, in scRNAseq, PCA is the most popular dimensionality reduction technique. With it we can reduce our features from ~30,000 genes to ~30 principle components, reducing the computational workload of complex analysis tasks, like creating the nearest neighbors graph.
Embeddings
This is another term for a latent space, but used to refer to what we have done to the cells. The two terms are most often interchangeable. This term implies the “embedding” or placement of the cells in a different feature space.
Loadings
This is the term used to describe the feature translation matrix. Which genes are in which PC? The Loadings matrix shows the weights of each gene in each PC. A negative value in this matrix means expression of the gene pushes cells in the negative direction of the PC, a positive value pushes the cell in the positive direction.
Additional Reading
Principal Component Analysis (PCA)
- A short synopsis of PCA by Josh Starmer, aka StatQuest - This video uses scRNAseq data to explain some ideas behind PCA
- Understanding PCA by Victor Powell - a visually interactive explanation of the algorithm behind PCA.
- Jolliffe, I.T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202. read here
Harmony Integration
- Harmony’s Reference Website
- Harmony Github.
- Korsunsky, I., Millard, N., Fan, J., et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16, 1289–1296. read here
- A detailed walkthrough of the harmony algorithm
k-Nearest Neighbors (kNN) Graphs
- A nice introduction to kNN graphs can be found at kNN Visualization. We recommend this site in general for learning about algorithms in the context of single-cell RNA sequencing!
Key Takeaways
- PCA and other latent-space algorithms are essential in the scRNAseq workflow for dimensionality reduction. PCA is especially useful because it helps to visualize and interpret major trends and variability in the data. PCA output:
- a Cell x PC matrix; a new set of variables that combine the most covarying sets of genes in the dataset, for example M phase vs. G1 phase, Immune vs. Epithelial, Sick vs. Healthy.
- a PC x Gene matrix; the loadings or weights of each gene’s importance in each PC
- a Variance Explained list; the standard deviation in the dataset captured by each PC
- kNN graphs define neighborhoods of cells based on similarity. This is the ultimate goal of scRNAseq data processing, because it allows us to identify clusters of cells. kNN output:
- a Neighborhood x Cell matrix; A matrix showing which cells are in which neighborhoods
- Harmony integration minimizes batch effects by regularizing batches in any specified latent space. In benchmarking studies, it is often highlighted as the best integration method, balancing the removal of technical noise with the maintenance of known biological signals, like cell cycle. Like PCA, Harmony creates a latent space that can be visualized. Harmony output:
- a matrix of harmony integrated PCs per cell
- These methods are not only relevant in single-cell analysis but also applicable across various fields of data science and can be useful for analysis of any complex dataset.
Libraries
Load data
We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from the cellxgene portal.
se <- readRDS("../data/Covid_Flu_Seurat_Object.rds")Color palette
donor_pal <- c(
"#66C2A4", "#41AE76", "#238B45", "#006D2C",
"#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
"#FFEDA0", "#FED976", "#FEB24C", "#FD8D3C",
"#FC4E2A", "#E31A1C", "#BD0026", "#800026")
names(donor_pal) <- c(
"Normal 1", "Normal 2", "Normal 3", "Normal 4",
"Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
"nCoV 1", "nCoV 2", "nCoV 3_4", "nCoV 5",
"nCoV 6", "nCoV 7_8", "nCoV 9_10", "nCoV 11"
)To get up to speed with the previous worksheets, process the data in the same way.
se <- se %>%
NormalizeData(verbose = FALSE) %>%
FindVariableFeatures(
method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(verbose = FALSE, features = VariableFeatures(.))Analysis
PCA
When we run PCA on our Seurat object using the Seurat package’s implementation, PCA is calculated on the scaled, highly variable genes. The resulting PCA latent space is placed in the reductions slot.
se <- se %>%
RunPCA(
npcs = 30,
ndims.print = 1:5,
nfeatures.print = 10
)
# This is where latent spaces, aka dimensional reductions are stored. If we call it directly, we can see some information about the PCA.
se@reductions$pcaA dimensional reduction object with key PC_
Number of dimensions: 30
Number of cells: 59572
Projected dimensional reduction calculated: FALSE
Jackstraw run: FALSE
Computed using assay: RNA
Accessing PCA results
The PC x Cell latent space is stored in the cell.embeddings slot. The Gene x PC matrix of gene weights per PC is stored in the feature.loadings slot. The variance explained per PC is placed in the stdev slot
DT::datatable(se@reductions$pca@cell.embeddings)